<HTML>
<!--This file created 11.02.2005 11:35 Uhr by Claris Home Page version 3.0-->
<HEAD>
   <TITLE>SGE Checkpointing Howto</TITLE>
   <META NAME=GENERATOR CONTENT="Claris Home Page 3.0">
   <X-CLARIS-WINDOW TOP=48 BOTTOM=752 LEFT=13 RIGHT=874>
   <X-CLARIS-TAGVIEW MODE=minimal>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<P><FONT SIZE="+1"><B>Topic:</B></FONT></P>

<P>Checkpointing of serial jobs using the built in checkpointing
support in SGE.</P>

<P><FONT SIZE="+1"><B>Author:</B></FONT></P>

<P>Reuti, <A HREF="mailto:reuti__at__staff.uni-marburg.de">reuti__at__staff.uni-marburg.de;</A>
Philipps-University of Marburg, Germany</P>

<P><FONT SIZE="+1"><B>Version:</B></FONT></P>

<P>1.1a -- 2005-02-14 First updated release, comments and corrections
are welcome</P>

<P><FONT SIZE="+1"><B>Contents:</B></FONT></P>

<UL>
   <LI>Prerequisites</LI>
   
   <LI>The transparent interface</LI>
   
   <LI>The application-level interface</LI>
   
   <LI>The userdefined interface</LI>
   
   <LI>Integration with the Condor library</LI>
</UL>

<P><FONT SIZE="+1"><B>Note:</B></FONT></P>

<P>This HOWTO complements the information contained in the man pages
of SGE and the Administration Guide.</P>

<P><FONT SIZE="+1"><B>Acknowledgement:</B></FONT></P>

<P>Many thanks to Lip Kian NG (<A HREF="lkng__at__apstc.sun.com.sg">lkng__at__apstc.sun.com.sg</A>)
from SUN Microsystems Singapore for reviewing this document and
eliminating typos and ambiguities.</P>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Prerequisites</B></FONT></P>

<BLOCKQUOTE><B>Configuration of SGE with qconf or the GUI</B>
   
   <P>You should be familiar with modifying of the SGE environment
   such as the queue definitions and scheduler configuration.
   Additional information on the SGE checkpoint interfaces is
   available from the "checkpoint" and "sge_ckpt" man pages (make
   sure the SGE man pages are defined in your $MANPATH).</P>
   
   <P>While working through this HOWTO, you may find it convenient to
   modify the scheduler parameters "schedule_interval" to be around
   15 seconds, and optionally set:</P>
   
   <P><B>SGE 5.3</B>: the global configuration "schedd_params" to
   "FLUSH_SUBMIT_SEC=0".</P>
   
   <P><B>SGE 6.x</B>: the scheduler configuration "flush_submit_sec"
   to be unequal to "0", like a small value of "4", to enable this
   setting to be honoured.<BR>
   </P>
   
   <P><B>Available checkpointing interfaces in SGE</B></P>
   
   <P>You can divide the available checkpointing interfaces into two
   groups: they are either kernel based or need some support by the
   running application, hence called user-level interfaces. This
   document will show some examples on how to implement checkpointing
   with the different user-level checkpoint interfaces. These
   include:</P>
   
   <BLOCKQUOTE>Transparent interface<BR>
      Application-level interface<BR>
      Userdefined interface</BLOCKQUOTE>
   
   <P><B>Goal of checkpointing implementation in your
   applications</B></P>
   
   <P>Some lengthy calculations can run for months. If the
   calculation node dies due to a hardware failure or power loss, you
   will lose all the calculation time and have to restart from the
   beginning. Having a checkpoint created, you can at least restart
   from this point of calculation.</P>
   
   <P><B>Location of the created checkpoint files</B></P>
   
   <P>You should use a shared file system for the location of the
   written checkpoint files. This way the created checkpoint files
   will be accessible to the restarted job, which will be most likely
   on another node. Depending on your application, this can be just
   the home directory of the executing user, or like used in this
   document /home/checkpoint. It can of course be a mounted file
   system just for this purpose, but it shouldn't be a shared scratch
   space without redundancy, like a shared scratch space with
   PVFS1.</P>
   
   <P><B>Interval for the created checkpoints</B></P>
   
   <P>Although it depends on your application, you can just create
   checkpoints after each critical step of calculation without any
   timed interval. If it's more suitable, SGE can trigger the
   creation of checkpoints after a timed interval. The default
   setting in the queue definition for this behaviour is 5 minutes,
   which you should consider only as a setting for testing the SGE
   setup. For real applications that runs for weeks, it maybe
   sufficient to create a checkpoint after every 6 to 24 hours. With
   this setting, the amount of data written to the checkpoint
   directory will not slow down any other communications to the
   shared file system on a large scale. Also you will only lose the
   last 6 to 24 hours of calculations.</P>
   
   <P><B>Supplied files</B></P>
   
   <P>All scripts and files for the examples you will find in the
   following archive: <A HREF="Checkpoint_Howto_Examples.tar.gz">Checkpoint_Howto_Examples.tar.gz</A>.
   The paths in the scripts have to be adjusted to your installation
   of SGE and your intended directory to run the scripts.</P>
   
   <P><B>State transition diagrams</B></P>
   
   <P>The behaviour of the checkpoint interfaces can be described as
   a finite machine. State transition diagrams you can find in the
   Howto "Checkpointing under Linux with Berkeley Lab
   Checkpoint/Restart" on the Howto page at
   gridengine.sunsource.net.</P></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>The transparent interface</B></FONT></P>

<BLOCKQUOTE><B>Application Characteristic</B>
   
   <P>This interface is used for applications that has built-in
   checkpointing support based on listening to a UNIX signal.</P>
   
   <P><B>Behaviour of this interface</B></P>
   
   <P>This interface will send a signal to the running job to
   initiate a checkpoint, according to the "min_cpu_interval" setting
   in the queue definition. Hence you have to define values for
   "signal", "ckpt_dir" and "min_cpu_interval". Not all parameters in
   the definition of the checkpointing interface will be honoured.
   The needed entries for the checkpointing interface:</P>
   
   <BLOCKQUOTE><PRE>$ qconf -sckpt check_transparent
ckpt_name          check_transparent
interface          transparent
ckpt_command       none
migr_command       none
restart_command    none
clean_command      none
ckpt_dir           /home/checkpoint
queue_list         all
signal             usr2
when               xmr      </PRE></BLOCKQUOTE>
   
   <P>Here the "when" condition for the creation of a checkpoint file
   is set to "xmr". So a checkpoint will be created, by sending the
   signal usr2 (to the job, i.e. the whole processgroup), when the
   specified time interval (which is defined in the queue definition)
   has elapsed, or when the node goes offline.. It's not possible to
   initiate a checkpoint just before the migration of the job, but we
   set the "x" anyway to get the job at least restarted. This is
   limited to happen in the migration script, which is available for
   application-level checkpointing. For demonstration purpose, we set
   the time interval in the queue to 5 minutes (for testing purpose
   you can lower the value also to be around 15 seconds), which is
   the default but not useful for production, since a checkpoint
   every few hours is sufficient. The queue should contain the
   setting:</P>
   
   <BLOCKQUOTE><PRE>$ qconf -sq vast00
...
min_cpu_interval     00:05:00
...</PRE></BLOCKQUOTE>
   
   <P>The setting for the availability of the checkpointing interface
   depends on the used version of SGE.</P>
   
   <P><B>SGE 5.3</B>: The queue type must be set to CHECKPOINTING,
   and the intended queues added to the queue list in the definition
   of the checkpointing interface.</P>
   
   <P><B>SGE 6.x</B>: Here the definition is in the opposite way, so
   you have to add the name of the checkpointing interface to the
   queue definition. The type of the queue will automatically be set
   to support checkpointing.<BR>
   </P>
   
   <P><B>First script "checkpoint_1.sh" (example 1)</B></P>
   
   <P>Having this setting, let's start with just a bash script,
   demonstrating the checkpoint behaviour. For now we simply write a
   file to the directory pointed to by $SGE_CKPT_DIR. This is set by
   SGE from the above setting of the checkpointing interface:</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# check_transparent1.sh
      
trap 'date &gt;&gt; $SGE_CKPT_DIR/checkpoint_1' usr2
      
echo "Script started."
      
for ((i=0; i&lt;1000; i++)) ; do
    sleep 1
done
      
echo "Script finished."
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>Although here is no actual checkpoint created for now, you can
   have a look at the created files in the $SGE_CKPT_DIR and the
   output of the script. After some elapsed time, you can find the
   following:</P>
   
   <BLOCKQUOTE><PRE>$qsub -ckpt check_transparent check_transparent1.sh
your job 8005 ("check_transparent1.sh") has been submitted
$ ls /home/checkpoint/
checkpoint_1
$ cat /home/checkpoint/checkpoint_1.sh 
Tue Dec 21 23:03:35 WET 2004
Tue Dec 21 23:08:35 WET 2004</PRE>
      
      <P>Okay, this is working. So we may now implement a more
      working version.<BR>
      </P></BLOCKQUOTE>
   
   <P><B>Second script "checkpoint_2.sh" (example 2)</B></P>
   
   <P>For this we have to write some more valuable information to the
   checkpoint file.</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# check_transparent2.sh
      
trap 'echo $ACTUAL_VALUE &gt; $SGE_CKPT_DIR/checkpoint_2' usr2
      
#
# Check whether we are restarted and a checkpoint file is already avaiualble.
#
      
if &#91; "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r "$SGE_CKPT_DIR/checkpoint_2" &#93; ; then
    read ACTUAL_VALUE &lt; $SGE_CKPT_DIR/checkpoint_2
    echo "Script restarted with value $ACTUAL_VALUE."
else
    ACTUAL_VALUE=1
    echo "Script started."
fi
      
#
# Start of the program.
#
      
while &#91; "$ACTUAL_VALUE" -le 1000 &#93; ; do
    echo "Processing $ACTUAL_VALUE."
    let ACTUAL_VALUE++
    sleep 1
done
      
echo "Script finished."
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>This time we write some information about the actual value of
   the loop into the checkpoint file. Although this is not foolproof
   (if the node just crashes in the moment when we write as new
   version of the checkpoint file we would be out of luck), but for
   now it's feasible. So let's look at the created file after at
   least five minutes:</P>
   
   <BLOCKQUOTE><PRE>$ ls ../checkpoint/checkpoint_2 
../checkpoint/checkpoint_2
$ cat ../checkpoint/checkpoint_2 
292</PRE></BLOCKQUOTE>
   
   <P>This seems working, and we can now try to trigger the restart
   of the job. Instead of pulling the power line out of the node we
   can use the qmod command, which will in case of a correctly setup
   of the checkpointing interface kill the running job and restart it
   (i.e. migrate the job).</P>
   
   <BLOCKQUOTE><PRE>$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8016     0 check_tran reuti        r     12/22/2004 00:14:21 vast15     MASTER
$ qmod -s 8016
reuti - suspended job 8016</PRE></BLOCKQUOTE>
   
   <P>After a while, you will notice that the job is restarted and
   running eventually on a different node than before:</P>
   
   <BLOCKQUOTE><PRE>$qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8016     0 check_tran reuti        Rr    12/22/2004 00:24:06 vast11     MASTER</PRE></BLOCKQUOTE>
   
   <P>You will notice that this job has the same job number as the
   original one, but the status field "Rr" will show you, that this
   job was restarted. When you look at the generated output file, you
   will notice something like this:</P>
   
   <BLOCKQUOTE><PRE>$ o check_transparent2.sh.o8016
...
Processing 570.
Processing 571.
Script restarted with value 292.
Processing 292.
Processing 293.
Processing 294.
Processing 295.
Processing 296.
...</PRE></BLOCKQUOTE>
   
   <P>As expected, the processing is going on from the last saved
   state of the checkpoint file. So we can try to move the checkpoint
   creation to a working program, and change the bash script to
   provide only some setups for handling more than just one
   checkpoint file from a user.<BR>
   </P>
   
   <P><B>Third script "checkpoint_3.sh" (example 3)</B></P>
   
   <P>We should create a directory in the $SGE_CKPT_DIR and give this
   information to the program to support more than just one
   checkpointing program at a time. Since all signal handling is send
   to the whole process group, we have to ignore the signal in the
   shell and handle it only in the called program.</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# check_transparent3.sh
      
trap '' usr2
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if &#91; \! -e "$SGE_CKPT_JOB" &#93; ; then
    mkdir $SGE_CKPT_JOB
fi
      
if &#91; \! -d "$SGE_CKPT_JOB" &#93; ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
if &#91; "$RESTARTED" -eq "1" &#93; ; then
    /home/reuti/checkpoint_program3 -r -d $SGE_CKPT_JOB
else
    /home/reuti/checkpoint_program3 -d $SGE_CKPT_JOB
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>It may not be necessary to give extra options to the program at
   all, as all environment variables are also exported to the
   program. But handling it in the shell will keep the programs
   flexible to run with other queuing systems if necessary and adjust
   only the shell scripts. A sample program in C follows in the file
   checkpoint_program3.c. Please compile with:</P>
   
   <BLOCKQUOTE><PRE>$ gcc -o checkpoint_program3 checkpoint_program3.c -lm</PRE></BLOCKQUOTE>
   
   <P>The key points in the program are:</P>
   
   <UL>
      <LI>Check for valid checkpoint directory</LI>
      
      <LI>Read the old checkpoint file if requested</LI>
      
      <LI>Install a signal handler to process the checkpoint
      request</LI>
      
      <LI>Processing</LI>
      
      <LI>Output of the results</LI>
   </UL>
   
   <P>In this case, the program can also run without any change
   without SGE or checkpointing usage at all. It's just an option in
   the program. To have always only one checkpoint file in the given
   directory, the signal handler routine will remove the old one
   only, if a new version could be written.</P>
   
   <P>Let's submit the job and try to move it to a different machine.
   We may get the following output:</P>
   
   <BLOCKQUOTE><PRE>$ qsub -ckpt check_transparent check_transparent3.sh 
your job 8149 ("check_transparent3.sh") has been submitted
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8149     0 check_tran reuti        r     12/24/2004 11:08:00 vast12     MASTER</PRE></BLOCKQUOTE>
   
   <P>The job is running, and we may check the checkpoint directory,
   whether any file was written up to now.</P>
   
   <BLOCKQUOTE><PRE>$ ls /home/checkpoint/
8149
$ ls /home/checkpoint/8149/
$</PRE></BLOCKQUOTE>
   
   <P>Although there is none up to now, we may try to move it to a
   different machine anyway, because this may also happen with just
   the crash of the node.</P>
   
   <BLOCKQUOTE><PRE>$ qmod -s 8149
reuti - suspended job 8149
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8149     0 check_tran reuti        Rr    12/24/2004 11:09:09 vast15     MASTER
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.</PRE></BLOCKQUOTE>
   
   <P>The program discovered the absence of a valid checkpoint file
   and therefore just starts at the beginning. We will wait now,
   until a valid file was written.</P>
   
   <BLOCKQUOTE><PRE>$ ls /home/checkpoint/8149/
checkpoint_3
$ cat check_transparent3.sh.e8149
I will try to restart....
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
$ qmod -s 8149
reuti - suspended job 8149
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8149     0 check_tran reuti        Rr    12/24/2004 11:14:53 vast12     MASTER         
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
I will try to restart...
Checkpoint file read. Recalculation starts at 297.</PRE></BLOCKQUOTE>
   
   <P>This time we got the real processing of the checkpoint file. It
   starts again at 297, using the already calculated results. After
   the job, we may find this in the created error file as the whole
   checkpoint processing log:</P>
   
   <BLOCKQUOTE><PRE>$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
I will try to restart...
Checkpoint file read. Recalculation starts at 297.
Checkpoint creation initiated.
Checkpoint file created to restart with 594.
Checkpoint creation initiated.
Checkpoint file created to restart with 892.</PRE></BLOCKQUOTE>
   
   <P>Although the later written checkpoints weren't used, the whole
   procedure is working, and also the created directory for the job
   specific checkpoint files was deleted correctly:</P>
   
   <BLOCKQUOTE><PRE>$ ls /home/checkpoint/
$</PRE></BLOCKQUOTE>
   
   <P>We may no move on to the next available interface in SGE, with
   a slightly different behaviour.</P></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>The userdefined interface</B></FONT></P>

<BLOCKQUOTE><B>Application Characteristic</B>
   
   <P>This interface is used for applications that has built-in
   checkpoint support which will automatically checkpoint based on
   the application's internal logic and are not dependent on external
   factors to initiate a checkpoint.</P>
   
   <P><B>Behaviour of this interface (example 4)</B></P>
   
   <P>This interface is similar to the transparent interface, but in
   this case there is no need to specify a signal at all (specifying
   a signal will lead to the same behaviour as the transparent
   interface):</P>
   
   <BLOCKQUOTE><PRE>$ qconf -sckpt check_userdefined
ckpt_name          check_userdefined
interface          userdefined
ckpt_command       none
migr_command       none
restart_command    none
clean_command      none
ckpt_dir           /home/checkpoint
queue_list         all
signal             none
when               xr</PRE></BLOCKQUOTE>
   
   <P>The creation of the checkpoints is up to the working program.
   There will be no external trigger by SGE to create any checkpoint
   file at all (unless you specify the signal). Having this setup,
   it's still advantageous using checkpointing, because SGE will
   restart your job and give this information along with the job.</P>
   
   <P>We can use nearly the same script as in example 3 to start the
   job (without the trap command to catch the signal), but the
   program checkpoint_program4.c has to decide when to write a
   checkpoint. In a real working program this may be after each
   critical step. This may also be easier to implement, than writing
   all the data of the current program and the state of it (maybe in
   form of a finite machine) to a file.</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# check_userdefined4.sh
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if &#91; \! -e "$SGE_CKPT_JOB" &#93; ; then
    mkdir $SGE_CKPT_JOB
fi
      
if &#91; \! -d "$SGE_CKPT_JOB" &#93; ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
if &#91; "$RESTARTED" -eq "1" &#93; ; then
    /home/reuti/checkpoint_program4 -r -d $SGE_CKPT_JOB
else
    /home/reuti/checkpoint_program4 -d $SGE_CKPT_JOB
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>You may notice, that here the option -d will also trigger the
   creation of the checkpoint file within the program. As we had
   already walked through the created output in example 3, here is
   just the output of the issued commands:</P>
   
   <BLOCKQUOTE><PRE>$ qsub -ckpt check_userdefined check_userdefined4.sh 
your job 8150 ("check_userdefined4.sh") has been submitted
$ qstat -u reuti             
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8150     0 check_user reuti        r     12/24/2004 11:43:15 vast12     MASTER         
$ ls /home/checkpoint/
8150
$ ls /home/checkpoint/8150/
$ qmod -s 8150
reuti - suspended job 8150
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8150     0 check_user reuti        Rr    12/24/2004 11:47:31 vast22     MASTER         
$ cat check_userdefined4.sh.e8150 
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
$ ls /home/checkpoint/8150/                         
checkpoint_4
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
$ qmod -s 8150
reuti - suspended job 8150
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8150     0 check_user reuti        Rr    12/24/2004 11:59:46 vast12     MASTER         
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
I will try to restart...
Checkpoint file read. Recalculation starts at 600.
$ qstat
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
I will try to restart...
Checkpoint file read. Recalculation starts at 600.
Checkpoint creation initiated.
Checkpoint file created to restart with 900.  </PRE></BLOCKQUOTE></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>The application-level interface</B></FONT></P>

<BLOCKQUOTE><B>Application Characteristics</B>
   
   <P>This interface is usually used for applications that require an
   external process to intiate a successful checkpoint such as
   passing a message, moving the checkpoint file to the shared
   directory etc.</P>
   
   <P><B>Behaviour of this interface (example 5)</B></P>
   
   <P>This interface will execute the defined commands for
   ckpt_command, migr_command and clean_command. The restart_command
   will not be honoured, since this is reserved for the implemented
   kernel level checkpointing interfaces. You may also specify just
   signal names instead of procedures, but in our example we will use
   some procedures, to take care of the checkpoint files. Optional
   you may also specify a signal, which we also not use here.</P>
   
   <BLOCKQUOTE><PRE>$ qconf -sckpt check_application-level
ckpt_name          check_application-level
interface          application-level
ckpt_command       /home/reuti/check/checkpoint.sh $ckpt_dir $job_id
migr_command       /home/reuti/check/migrate.sh $job_pid $ckpt_dir $job_id
restart_command    none
clean_command      /home/reuti/check/clean.sh $ckpt_dir $job_id
ckpt_dir           /home/checkpoint
queue_list         all
signal             none
when               xmr</PRE></BLOCKQUOTE>
   
   <P>In this setup, the defined command for "ckpt_command" will be
   called to create a checkpoint file in case of the
   "min_cpu_interval", the "migr_command" procedure will be executed
   if you suspend the job or the queue which the job is running in
   (remember, that this can be done by hand using qmod, or also
   automatically by exceeding a set "suspend_thresholds" for the
   queue). Finally, the "clean_command" will remove the created
   checkpoint files in case of a qdel or successful completion of the
   job.</P>
   
   <P>Having such an interface maybe still useful, if you have the
   binary of a program only, which will create checkpoint files or
   can use some kind of scratch file for a restart, but will always
   write it to the temporary directory like $TMPDIR, and you have no
   option to rename it or save it in a different location. Another
   application of this interface maybe in the case, that any
   checkpoint or scratch files are written too often or are too big.
   In these cases it's advantageous to write to a local filesystem
   first, to minimize the network traffic. The checkpointing process
   can then copy the local checkpoint file from e.g. $TMPDIR to the
   shared checkpoint directory.</P>
   
   <P>For the example, we will reuse a slightly modified program as
   in example 4, but write a checkpoint after each 5 steps. Since
   this is too often to be put on the shared checkpoint directory, we
   will use the defined script for the creation of checkpoint in the
   application interface to copy this file less often to the shared
   checkpoint directory. For the restart, this copied file will be
   used, because all the files in the scratch directory $TMPDIR are
   of course deleted with the removing of the job from the node.</P>
   
   <P>Although the copy of the checkpoint file in the
   "min_cpu_interval" may be sufficient, we can get a more recent
   version when we start the migrate command and perform another
   checkpoint copy. In case of a failure of the node, this won't be
   possible of course. So the necessary checkpoint.sh script may look
   like:</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
#
# checkpoint.sh
#
      
#
# Copy a checkpoint file from the scratch directory of the
# job to the checkpoint directory.
#
      
me=`basename $0`
      
# test number of args
if &#91; $# -ne 2 &#93;; then
   echo "$me: got wrong number of arguments" &gt;&amp;2
   exit 1
fi
      
SGE_CKPT_JOB=$1/$2
      
if &#91; \! -e "$SGE_CKPT_JOB" &#93; ; then
    mkdir $SGE_CKPT_JOB
fi
      
if &#91; \! -d "$SGE_CKPT_JOB" &#93; ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
while &#91; -z "$DONE" &#93; ; do
    cp $TMPDIR/checkpoint_5 $SGE_CKPT_JOB/checkpoint 2&gt;/dev/null
    if &#91; "$?" -eq "0" &#93; ; then
        diff $TMPDIR/checkpoint_5 $SGE_CKPT_JOB/checkpoint 1&gt;/dev/null 2&gt;&amp;1
        if &#91; "$?" -eq "0" &#93; ; then
            DONE="1"
        fi
    else
        DONE="1"
    fi
done
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>To get a valid copy of the checkpoint file, we do a simple test
   here by a diff. You may have to do another form of the copy
   command with your real application. The two arguments given to the
   procedure will be used to create the name of the checkpoint
   directory and create it in the first call to the procedure. The
   next procedure is to initiate the migration of the job. In this
   type of checkpoint interface, you have to kill the job on your
   own. We do this with a kill -9 to the whole process group.</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
#
# migrate.sh
#
      
me=`basename $0`
      
# test number of args
if &#91; $# -ne 3 &#93;; then
   echo "$me: got wrong number of arguments" &gt;&amp;2
   exit 1
fi
      
#
# Get the current checkpoint besides the regular copied one.
#
      
/home/reuti/check/checkpoint.sh $2 $3
      
#
# Now kill the job with the whole process group.
#
      
kill -9 -- -$1</PRE></BLOCKQUOTE>
   
   <P>The last step is to remove the created checkpoint directory
   after the job. This is done in an appropriate procedure which will
   be executed at the end of the job.</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
#
# clean.sh
#
      
#
# Delete the checkpoint directory for this job.
#
      
me=`basename $0`
      
# test number of args
if &#91; $# -ne 2 &#93;; then
   echo "$me: got wrong number of arguments" &gt;&amp;2
   exit 1
fi
      
SGE_CKPT_JOB=$1/$2
      
if &#91; \! -d "$SGE_CKPT_JOB" &#93; ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
rm -rf $SGE_CKPT_JOB 
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>Having defined this all, we may submit a script for this type
   of checkpointing interface with:</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# check_application-level5.sh
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
#
# Check whether we are restarted.
#
      
SGE_CKPT_FILE=$SGE_CKPT_JOB/checkpoint
      
if &#91; "$RESTARTED" -eq "2" -a -e "$SGE_CKPT_FILE" -a -r "$SGE_CKPT_FILE" &#93; ; then
    /home/reuti/checkpoint_program5 -r $SGE_CKPT_FILE -d $TMPDIR
else
    /home/reuti/checkpoint_program5 -d $TMPDIR
fi
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>Please notice, that in this case the value of $RESTARTED is 2.
   We also take care of the situation where a node crashed before the
   first checkpoint file was successfully created. During the
   restart, you may look into the defined checkpoint directory and
   check if the job's corresponding subdirectory contains any
   checkpoint file. After the job completes or gets deleted (by
   qdel), the job's corresponding checkpoint directory is removed by
   the procedure clean.sh. For this example, ensure that the correct
   name of the checkpointing interface is specified as it's different
   from the previous examples:</P>
   
   <BLOCKQUOTE><PRE>qsub -ckpt check_application-level check_application-level5.sh</PRE></BLOCKQUOTE></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Integration with the Condor library</B></FONT></P>

<BLOCKQUOTE><B>Checkpointing with Condor in general</B>
   
   <P>Another queuing system is Condor, which in addition supply
   built in checkpointing libraries (<A HREF="http://www.cs.wisc.edu/condor">http://www.cs.wisc.edu/condor</A>).
   But it is possible to use just the Condor libraries in a
   standalone mode without any queuing system at all or an other one.
   This way you can still use SGE and add the checkpointing facility
   of Condor. After downloading the appropriate Condor version for
   your system, you have to install it in a personal mode to get just
   the access to the compiler. So you may install it in a
   subdirectory inside your home directory, e.g. ~/local:</P>
   
   <BLOCKQUOTE><PRE>$ ls -d1 condor*
condor-6.6.7-linux-x86-redhat80.tgz
$ tar -xzf condor-6.6.7-linux-x86-redhat80.tgz 
$ ls -d1 condor*
condor-6.6.7
condor-6.6.7-linux-x86-redhat80.tgz
$ cd condor-6.6.7/
$ ./condor_configure --install=/home/reuti/condor-6.6.7/release.tar \
  --install-dir=/home/reuti/local/condor-6.6.7 --make-personal-condor --verbose
$ cd</PRE></BLOCKQUOTE>
   
   <P>Having done this, you can remove the installation directory of
   Condor again. Now you have to execute two commands, to get access
   to the compiler. These can be put in any of the usual files which
   will be sourced during login.</P>
   
   <BLOCKQUOTE><PRE>$ export CONDOR_CONFIG=/home/reuti/local/condor-6.6.7/etc/condor_config
$ export PATH=/home/reuti/local/condor-6.6.7/bin:$PATH</PRE></BLOCKQUOTE>
   
   <P>If all was installed successfully, it's now already possible to
   use the compiler. Although further details can be looked up in the
   online Condor manual, we want just compile a program, to see
   whether it's working this way:</P>
   
   <BLOCKQUOTE><PRE>$ cat ever.c
int main(void)
{
    float x;
    long  i;
      
    for (;;)
    {
        for (i=0;i&lt;=100000;i++)
            x=3.1415926*i+i+i*i*2.7182818;
    }
      
    return 0;
}
$ condor_compile gcc ever.c -o ever
...
$ ls -l ever  
-rwxr-xr-x    1 reuti    users    15199008 Dec 25 22:18 ever
$ strip ever
$ ls -l ever
-rwxr-xr-x    1 reuti    users     1332132 Dec 25 22:18 ever</PRE></BLOCKQUOTE>
   
   <P>Usually the created executables by Condor are really big, and
   can be sized down by the command strip, as used in the above
   example. Before integrating this into SGE, we want to execute the
   short example to see it working.</P>
   
   <BLOCKQUOTE><PRE>$ ./ever
Condor: Notice: Will checkpoint to ./ever.ckpt
Condor: Notice: Remote system calls disabled.
^C
$ ./ever -_condor_restart ever.ckpt
Condor: Notice: Will restart from ever.ckpt</PRE></BLOCKQUOTE>
   
   <P>Before we kill the program with a Ctrl-C, we login to another
   terminal session on the machine and send a "kill -usr2
   &lt;pid&gt;" to the executing "ever" program. The SIGUSR2 is the
   used signal in Condor to force the creation of a checkpointing
   file. The process list can be retrieved with "ps -e f". By default
   the checkpoint file is written to a file with the name of the
   executable appended with .ckpt, but can be forced to be a
   different one with the option "-_condor_ckpt &lt;filename&gt;". In
   case of a restart, it is sufficient to give the option
   "-condor_restart &lt;filename&gt;", because this will also set the
   filename for further checkpoints. In fact, the additional option
   "-condor_ckpt &lt;filename&gt;" will be ignored in this case.</P>
   
   <P>There are also function calls available in Condor to be used in
   a program, which you will find explained in section 4.2.4
   (Checkpoint Library Interface) of the Condor user manual.</P>
   
   <P>Please notice, that there are some restrictions on the suitable
   programs which can be compiled with Condor; a full list of these
   you can find in section 1.4 (Current limitations) of the Condor
   user manual.<BR>
   </P>
   
   <P><B>Integrating Condor with the transparent checkpoint interface
   (example 6)</B></P>
   
   <P>Now we will use plain C program "condor_program6.c" to
   integrate it into SGE. Because we are not using any library
   functions of Condor in this case, we can put the whole restart
   procedure in the starting script:</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# condor_transparent6.sh
      
trap '' usr2
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if &#91; \! -e "$SGE_CKPT_JOB" &#93; ; then
    mkdir $SGE_CKPT_JOB
fi
      
if &#91; \! -d "$SGE_CKPT_JOB" &#93; ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
SGE_CKPT_FILE=$SGE_CKPT_JOB/checkpoint_6
      
if &#91; "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_FILE" -a -r "$SGE_CKPT_FILE" &#93; ; then
    /home/reuti/condor_program6 -_condor_restart $SGE_CKPT_FILE
else
    /home/reuti/condor_program6 -_condor_ckpt $SGE_CKPT_FILE
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>The corresponding program can be compiled with</P>
   
   <BLOCKQUOTE><PRE>condor_compile gcc condor_program6.c -o condor_program6 -lm; strip condor_program6</PRE></BLOCKQUOTE>
   
   <P>You will notice, that this program is really short and without
   any reference to a checkpointing at all, because this is handled
   by the Condor library. Because the "sleep(1)" command used in the
   previous examples is not allowed to be used in Condor programs, we
   have to replace this with an active loop. We can reuse the already
   defined transparent interface of the above examples and submit the
   job with:</P>
   
   <BLOCKQUOTE><PRE>$ qsub -ckpt check_transparent condor_transparent6.sh
your job 8188 ("condor_transparent6.sh") has been submitted</PRE></BLOCKQUOTE>
   
   <P>After the first creation of a checkpoint file, we can try to
   suspend this job to move it to a different node.</P>
   
   <BLOCKQUOTE><PRE>$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8188     0 condor_tra reuti        r     12/25/2004 16:24:32 vast11     MASTER         
$ ls ../checkpoint/8188/
checkpoint_6
$ qmod -s 8188
reuti - suspended job 8188
$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
   8188     0 condor_tra reuti        Rr    12/25/2004 16:30:50 vast12     MASTER         
$ cat condor_transparent6.sh.e8188
Condor: Notice: Will checkpoint to /home/checkpoint/8188/checkpoint_6
Condor: Notice: Remote system calls disabled.
Condor: Notice: Will restart from /home/checkpoint/8188/checkpoint_6
      </PRE></BLOCKQUOTE>
   
   <P>This works seamlessly with your applications. Because Condor
   will try to use exactly the same environment and files in case of
   a restart as during the original run, you can't use any files in
   $TMPDIR, because this will be have a different name when the job
   starts on the next node. Therefore it's limited to the home
   directory of yours, any other shared directory across the nodes or
   a shared scratch space like supplied by PVFS{1,2}.<BR>
   </P>
   
   <P><B>Integrating Condor library functions with the transparent
   checkpoint interface (example 7)</B></P>
   
   <P>You are a little bit more flexible in using Condor, when you
   integrate the library functions direct in your program. This way
   the job script will look without having any hint of Condor:</P>
   
   <BLOCKQUOTE><PRE>#!/bin/sh
# condor_transparent7.sh
      
trap '' usr2
      
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
      
if &#91; \! -e "$SGE_CKPT_JOB" &#93; ; then
    mkdir $SGE_CKPT_JOB
fi
      
if &#91; \! -d "$SGE_CKPT_JOB" &#93; ; then
    echo "Checkpoint subdirectory couldn't be found."
    exit 1
fi
      
#
# Check whether we are restarted.
#
      
if &#91; "$RESTARTED" -eq "1" &#93; ; then
    /home/reuti/condor_program7 -r -d $SGE_CKPT_JOB
else
    /home/reuti/condor_program7 -d $SGE_CKPT_JOB
fi
      
rm -rf $SGE_CKPT_JOB
      
exit 0</PRE></BLOCKQUOTE>
   
   <P>This time we have to supply the necessary functions in the
   program. The lines with access to the Condor library are
   "init_image_with_file_name(checkpoint_read);" and "restart();".
   The generation of a checkpoint is still initiated with a SIGUSR2
   signal. So we don't have to take care of it by programming by
   hand. A sample execution my give these results in the standard
   error file:</P>
   
   <BLOCKQUOTE><PRE>$ o condor_transparent7.sh.e8189 
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
I will try to restart...</PRE></BLOCKQUOTE>
   
   <P>Because at the first suspend of the job was no checkpoint file
   written, it will discover this and start at the beginning. The
   latter suspend restarted the program.<BR>
   </P>
   
   <P><B>Integrating Condor library functions with the userdefined
   checkpoint interface (example 8)</B></P>
   
   <P>This way we have the possibility to control also the creation
   of a checkpoint file in the program instead of rely on the SIGUSR2
   method. The necessary call is "ckpt();" in the routine, which was
   used without the Condor library to handle the writing of the
   checkpoint file all alone. After compiling the program
   condor_program8.c in the same fashion as the previous examples,
   you have to submit the job with the specification of the
   checkpointing interface set to check_userdefined, as done in the
   examples without the usage of the Condor library.</P>
   
   <P></P></BLOCKQUOTE>
</BODY>
</HTML>
